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Abstract 

Background: ChlP-seq provides new opportunities to study allele-specific protein-DNA binding (ASB). However, 
detecting allelic imbalance from a single ChlP-seq dataset often has low statistical power since only sequence reads 
mapped to heterozygote SNPs are informative for discriminating two alleles. 

Results: We develop a new method iASeq to address this issue by jointly analyzing multiple ChlP-seq datasets. iASeq 
uses a Bayesian hierarchical mixture model to learn correlation patterns of allele-specificity among multiple proteins. 
Using the discovered correlation patterns, the model allows one to borrow information across datasets to improve 
detection of allelic imbalance. Application of iASeq to 77 ChlP-seq samples from 40 ENCODE datasets and 1 genomic 
DNA sample in GM1 2878 cells reveals that allele-specificity of multiple proteins are highly correlated, and demonstrates 
the ability of iASeq to improve allelic inference compared to analyzing each individual dataset separately. 

Conclusions: iASeq illustrates the value of integrating multiple datasets in the allele-specificity inference and offers a 
new tool to better analyze ASB. 

Keywords: Allele-specific binding, Transcription factor, Histone modification, Data integration, Next-generation 
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Background 

In a diploid organism, each somatic cell has two copies 
of the genome. At certain genomic loci, gene expression, 
DNA methylation, transcription factor (TF) binding or 
histone modification (HM) can be allele-specific. In other 
words, the two alleles can behave differently. These phe- 
nomena, also known as allele-specific expression (ASE), 
allele-specific DNA methylation (ASM) and allele-specific 
binding (ASB, including both allele-specific TF binding 
and allele-specific histone modifications), can contribute 
to phenotypic diversity and may play important roles in 
adaptive evolution [1-3]. Many allele-specific (AS) events 
have been found to correlate with variants in genomic 
sequences [4-11]. Comprehensively characterizing allele- 
specificity therefore can help with linking genotypes to 
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phenotypes. Abnormal AS events have also been linked to 
various diseases [12-15]. For instance, loss of imprinting 
in IGF2 has been associated with increased risk of col- 
orectal cancer [12]. This again highlights the importance 
of studying allele-specificity. 

Early methods for analyzing AS events rely on low- 
throughput technologies such as real time quantitative 
PCR [1]. Later, application of SNP arrays has made the 
AS analysis high- throughput [16-19]. More recently, the 
rapidly evolving high-throughput sequencing technolo- 
gies opened the door to produce digital read-out of AS 
events genome-wide without being constrained by any 
specific array design [5,14,15,20-24]. This brings many 
new opportunities as well as analytical challenges. 

ChlP-seq, a technology that couples chromatin immu- 
noprecipitation with high-throughput sequencing, has 
become the state-of-the-art approach for mapping 
genome-wide TF binding sites and HMs [25-28]. How- 
ever, so far the value of this technology for studying ASB 
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has not been fully utilized. Detecting ASB from a sin- 
gle ChlP-seq dataset often suffers from low statistical 
power. This is because only a small fraction of reads in 
each ChlP-seq sample are mapped to heterozygote SNPs, 
and only these reads are informative for inferring allele- 
specificity. To make the ChlP-seq based ASB analysis 
more useful, it is important to have either experimental or 
analytical innovations to increase the power for detecting 
allele-specificity. 

ChlP-seq data in public domains grow rapidly. A 
recently developed database hmChIP, for instance, has 
compiled over 450 human and mouse ChlP-seq datasets 
representing approximately 2000 samples from 140+ dif- 
ferent TFs and HMs [29,30]. The large volume of data 
provides a new opportunity to improve detection of ASB. 
Conceptually, an integrative analysis of ChlP-seq data for 
different TFs and HMs from the same individual and cell 
type may allow one to discover the synergistic correla- 
tion patterns of allele-specificity among different proteins. 
These correlation patterns can then be utilized to inte- 
grate information from multiple datasets to improve the 
ASB detection. For example, if the allelic imbalance of 
TF A and HM B always co-occur, then analyzing their 
ChlP-seq data jointly will increase the effective number 
of reads available for allelic inference which will then 
increase the statistical power. Unfortunately, existing data 
analysis tools cannot deal with this emerging opportu- 
nity. Methods available for analyzing ASE or ASB using 
the next-generation sequencing data are all designed for 
analyzing one dataset at a time. While a few methods 
are developed for solving problems such as read map- 
ping biases [31], construction of individualized genome 
sequences [32], and combining multiple SNPs in the same 
gene to infer ASE [33], no methods and software tools are 
available for jointly analyzing multiple ChlP-seq datasets 
together to discover synergy patterns of allele-specificity 
among multiple proteins and then use the correlation pat- 
terns to increase the power of ASB detection by borrowing 
information across datasets. 

In this article, we present an integrated solution to this 
problem by developing a new approach, iASeq, for jointly 
analyzing allele-specificity in multiple ChlP-seq datasets. 
iASeq uses a Bayesian hierarchical mixture model to 
describe unknown correlation patterns of allele-specificity 
among multiple datasets. These patterns can be discov- 
ered automatically from the data by fitting the model using 
an Expectation-Maximization (EM) algorithm. Using the 
identified correlation patterns, the model allows one to 
integrate information from multiple datasets to improve 
the ASB detection. Applying this approach, we analyzed 
40 ENCODE [34] ChlP-seq datasets in GM12878 cells, 
representing a total of 77 samples from 34 TFs and HMs. 
The analysis demonstrates the ability of iASeq to auto- 
matically integrate information from multiple datasets to 



significantly improve the detection of allelic imbalance. 
iASeq is implemented as an R package which is freely 
available from Bioconductor [35]. 

Methods 

Data structure 

Suppose there are D ChlP-seq datasets generated using 
cells from the same individual and the same cell type. 
Each dataset d corresponds to one TF or HM, and has Jd 
replicate samples (Figure la). Different datasets represent 
different TFs or HMs, or data generated by different labs. 
For the individual in question, assume one is interested 
in analyzing / heterozygote SNPs with known genotypes. 
We want to know whether the two alleles of each SNP 
behave differently in each dataset, and if possible how 
the AS events are correlated among datasets. For each 
SNP, the allele consistent with the reference genome is 
called the reference allele, and the other allele is called the 
non-reference allele. 

After read mapping and data preprocessing (see 
Additional file 1: Supplemental Methods S.l), we count 
reads for each allele at each heterozygote SNP. For SNP /, 
dataset d and replicate sample let Xtdj and y^j be the read 
counts for the reference allele and non-reference allele 
respectively. Let n^j = %idj + Jidj be the total read count 
(See Figure la for a toy example). Protein-DNA binding 
can be skewed to the reference allele (SR), skewed to the 
non-reference allele (SN), or not allele-specific (NS). We 
use a binary variable bid to indicate whether SNP i is SR 
(bid = 1) or not (bid = 0) m dataset d. If bid — 1> then SNP 
i is assumed to be SR in all replicate samples in dataset 
d. Similarly, we introduce another binary indicator to 
indicate whether SNP i is SN or not in dataset d. bid an d 
Cid cannot be equal to one at the same time. If bid — 0 
and Cid = 0, then SNP i is NS in dataset d. The config- 
uration at each SNP i can be described by two vectors 
Bi = (ha, • • • , b iD ) and Q = (c/i, • • • , c iD ) (See Figure 
Id for a cartoon illustration). Based on these notations, 
(*idj>yidj)> or equivalently (x id p n id j), are the observed data 
for SNP i in sample (d,j), whereas the indicators bid and 
Cid are unobserved. 

Main intuition and challenge 

Our primary goal is to infer for each SNP whether there 
is allelic imbalance in each dataset. This is equivalent 
to inferring bid an d c^. A simple solution to this prob- 
lem is to analyze each individual dataset separately, but 
this approach has low statistical power since the counts 
(xidjt nidj) usually are small. 

If one knows how different datasets are correlated in 
terms of allelic imbalance, this knowledge may be used to 
improve the data analysis. For instance, if the allelic imbal- 
ance of two proteins A and B are closely correlated, then 
observing skewed read counts for protein A will provide 
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Figure 1 The iASeq model, (a) An example of the data structure. Each row represents a SNP and each column corresponds to either the reference 
allele (R) or the non-reference allele (N) read counts from a ChlP-seq sample in a dataset. A dataset could be a TF ChlP-seq experiment or a HM 
ChlP-seq experiment, and can have multiple replicate samples (Rep). iASeq assumes the following data generating process, (b) First, SNPs belong to 
K + 1 classes with different ASB patterns. For each SNP, a class label a\ is randomly assigned according to a class abundance probability vector n. 
Given the class label, a configuration [ b^, c i( j] is generated for each SNP in each dataset according to the probabilistic allele-specificity patterns 
specified by two vectors Vk and Wk- In the figure, the darkness of each cell in 1/ and W represents the probability for b j( j or c,y to be 1. (c) Next, a 
skewing probability p^ is generated for each SNP /', dataset d and replicate sample j based on [b/^QyJ.The distribution of p,yy for NS SNPs in each 
sample follows a Beta distribution (blue lines), p^s for SR SNPs are uniformly distributed in the interval [pdjoA] where p^ 0 is the mean of the 
background Beta distribution (dark blue lines), p^s for SN SNPs are uniformly distributed in the interval [0,p^ 0 ] (light blue lines), (d) Finally, given 
the configuration [b/d,c,y], skewing probability p/^- and a total read count n,yy for SNP /', dataset d and sample j, the read count for each allele is 
generated according to a binomial distribution. The length of the orange bar represents the non-reference allele read count, and the length of the 
red bar represents the reference allele read count. 



information for inferring the allelic imbalance of protein 
B. Integrating the data from both A and B will increase the 
effective number of reads available for statistical inference, 
which will then lead to increased statistical power. 

In reality, how different proteins are correlated is usu- 
ally unknown. However, one may learn it by studying the 



data from many SNPs. Each SNP has three possible states 
in each dataset: SR, SN and NS. For D datasets, there are 
3 d possible configurations in total. From studying many 
SNPs, one can know the relative frequencies (or mixing 
proportions) of these 3 D configurations. The mixing pro- 
portions will tell how different datasets are correlated. For 



Wei etal. BMC Genomics 201 2, 13:681 
http://www.biomedcentral.eom/1 471 -21 64/1 3/681 



Page 4 of 19 



instance, let [51,52, • • • ,sd] be the skewness configuration 
of a SNP in the D datasets. If the mixing proportions for 
three configurations [NS,NS, • • • ,NS\, [SR,SR, • • • ,SR] 
and [ SN, SN,-> , SN] are 0.9, 0.05 and 0.05, then no other 
configurations exist in the data and all datasets are per- 
fectly correlated in terms of the allelic imbalance. In other 
words, at a particular SNP, if one dataset is SR, then all 
the other datasets are also SR. If one is SN, then all the 
others are also SN. On the other hand, if other config- 
urations have non-zero mixing proportions, then not all 
datasets are perfectly correlated, and at a particular SNP, 
one allows the possibility that only a subset of datasets 
are correlated. For instance, if the mixing proportion for 
a configuration [SR,SR,NS, - • • ,NS] is 0.03, then there 
will be 3% of SNPs that are skewed to the reference allele 
in the first two datasets but not skewed in the other 
datasets. Therefore, knowing the mixing proportions of 
all 3 d configurations will tell one the correlation structure 
in the data. This knowledge can then be used to improve 
statistical inference at each individual SNP by facilitat- 
ing information sharing across datasets. For example, if 
the configuration [SR, SR, SN] has a much higher mix- 
ing proportion than [SR, SR, NS], then observing strong 
skewness towards the reference allele of a SNP in the first 
two datasets will imply that, a priori, the SNP is highly 
likely to be skewed to the non-reference allele in the third 
dataset and has much lower probability to be non-skewed 
for both alleles. The principle here is the same as the prin- 
ciple represented by the Bayesian hierarchical models in 
the statistical literature. 

A limitation of this approach is that one has to enumer- 
ate all 3 D AS configurations in order to describe the corre- 
lation. As the number of datasets increases, the number of 
possible configurations increases exponentially. Thus this 
approach does not scale well with the increasing D. Later, 
in our analysis of GM12878 data, D = 40 and 3 D > 10 19 . 
This simple approach is clearly intractable. 

To circumvent the difficulty of documenting the fre- 
quencies of all 3 d configurations, iASeq employs a tech- 
nique that can describe the major correlation patterns 
in the data using a few probability vectors whose val- 
ues vary from 0 to 1 rather than being dichotomous 
(i.e., 0 or 1). This approach significantly reduces the 
model complexity but keeps the flexibility to account 
for all 3 d configurations. It is easily scalable to increas- 
ing dataset number. The correlation structure in the 
model can then be used to improve the statistical infer- 
ence of allelic imbalance at each SNP in each individual 
dataset. 

Probability model 

iASeq is based on the Bayesian hierarchical mixture model 
below that uses several probability vectors to describe 
the major correlation patterns among multiple datasets 



(Figure 1). The model assumes that SNPs can be grouped 
into K + 1 classes with different allele-specificity patterns 
(K 3 d ), and the observed data are viewed as generated 
as follows: 

• First, a class label at is randomly assigned to each 
SNP i according to a probability vector 

7r = (tto, tt\, - - - , 71 k). Here, 7i k = Pr(ai = k) is the 
prior probability to assign a SNP to class k. 
£* *k = 1. 

• If the class label d{ — 0, then B{ = (0, • • • , 0) and 
d = (0, • • • , 0). In other words, all SNPs in class 0 
are background SNPs, and they are NS in all datasets. 
If at = k and k 7^ 0, then SNP i can be skewed, and 
its [ bid; c id] s in different datasets are generated 
independently according to the following 
probabilities: Pribid = 1, Ctf = 0\a t = k) = v k d and 
Pr(bid = 0, Cid = l\ai = k) = Wkd- We assume 

Vkd + w kd < 1> i.e., Pr(bid = 0, c id = 0\a t = k) = 
1 — v kd — w kd > 0. The model implies that each class 
is associated with two vectors of probabilities 
Vk = (v*i, • • • , v kD ) and W k = (w kl , • • • , w kD ). For 
SNPs in class k, B{ and Q are generated according to 
the probabilities in V k and W k . 

• Next, the observed read counts are generated based 
on the AS configurations specified by BiS and Qs. 
Consider SNP i and dataset d If bid = 1> then 
(Xidj, riidj) in each replicate sample (d,j) is generated 
according to a probability distribution 

Pr(x id p n id j\bid = 1, c id = 0) = Pr(n id j\bid = 1, c id = 
0)Pr(xidj\nidj,bid = l,c id = 0) = Pr(n id j)fidj\(xidj)' 
Here we assume that the marginal distribution of riidj 
does not depend on bid an d c/^, and we use^/i i x idj) 
to denote the conditional distribution 
Pr(Xidj\riidj, bid = h Cid = 0). Data in different 
replicate samples are assumed to be generated 
independently. Similarly, if = 1, then (Xidp nidj) s 
are generated according to Pr(%idj, nidj\bid = 0, 
Cid = 1) = Pr(n id j)fidfi(oCidj)< ^ h id = 0 and c id = 0, 
then {%idp nidj)s are generated according to 
Pr(Xidj,nidj\bid = 0,c id = 0) = Pr(n id j)fidjo(Xidj)- 

For SNP i and dataset d, we organize data from all 
replicates j = 1, • • • J d into X id = (x id i, • • • >x id j d ) and 
Nid = (nidi, • • • , Hidj d ). For SNP /, X t = (X a , • • • ,X iD ) 
and Ni = (Nn, • • • ,Ni£>) contain data from all datasets. 
The final observed data are X = (X\, • • • ,Xj) and N = 
(ATi,--- ,Nj) which are the ensemble of data from all 
SNPs. 

Let A = (#!,-•• , ai) be the collection of class member- 
ship indictors of all SNPs, and let B — (B\, • • • ,Bj) and 
C = (Ci, • • • , C/) be the SR and SN indictors for all SNPs. 
A, B and C are the unobserved missing data one wants 
to infer. 
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Organize the probability vectors Vk and Wk from dif- 
ferent classes into two matrices Vk x d = (Vit ' ' ' > Vj<) T 
and W K xD = (Wf , • • • , W%) T . V> W, and the probabil- 
ity vector 7r that describes the class abundance are the 
unknown model parameters. K is assumed to be fixed. 
The choice of K and specification of data generating dis- 
tributions Pr(n idJ ), fidjoixuij), fidjiiXidj) ™d f id j2(x id j) will 
be discussed later. 

Based on this model, each SNP class k {k ^ 0) is asso- 
ciated with two vectors of probabilities Vk and Wk which 
characterize the allelic imbalance preferences in different 
datasets for SNPs belonging to class k. For example, if 
a class has [V k ; W k ] = [(0.8,0.7,0.1,0.1); (0.1,0.1,0.8,0.1)], 
then SNPs in this class have high probability to be SR in 
datasets 1 and 2, and high probability to be SN in dataset 
3, but they have low probability to be allele-specific in 
dataset 4. Since Vk and Wk are probabilities rather than 
0-1 vectors, each class k can generate all 3 D AS con- 
figurations. Therefore, SNPs in the same class are not 
required to have the same AS configuration (e.g., a class 
can have one SNP with configuration [SR,SR,NS,NS] 
while at the same time another SNP with configuration 
[SR,NS,SR,NS]), although they usually have similar AS 
configurations because SNPs in the same class are all 
generated using the same probability vectors. Meanwhile, 
there are K different classes, and each class has a different 
[ Vkt Wk] which specifies a different preference to gen- 
erate the skewing configurations. Thus, whereas SNPs in 
the same class tend to have similar [ Bf, Q] configurations, 
SNPs from different classes tend to have very different 
configurations. Conceptually, this is similar to a model- 
based clustering analysis in which SNPs are grouped into 
K+l clusters based on their [Bf, Q] configurations. How- 
ever, an important difference here is that [Bf,Ci]s are 
unknown. 

Our model assumes that [bi d ;Ci d ]s of the same SNP 
in different datasets are a priori independent condi- 
tional on the class membership However, [bi d ;Ci d ]s 
from different datasets are not independent marginally 
if one integrates out the class label For exam- 
ple, the marginal probability Pr([ b[ d ; Ci d ] = [ 1; 0] ) = 
I2k p r([b id ;c id ] = [l;0]\a i = k)Pr(a t = k) = 
Ylk=i n k v kd' On the other hand, the joint probability 
Pr([^;Q]=[(l,l,...,l);(0,0,...,0)]) = ELi^OL^), 
which is clearly different from the product of the 
marginals \\ d Pr([ b id ; c id ] =[ 1; 0] ) = Yid(Ek=i *kVkd). 
This explains why our model can be used to describe the 
correlation among multiple datasets despite the condi- 
tional independence assumption. Intuitively, if one views 
the model as a clustering analysis of SNPs based on 
[Bf, Q], then each cluster will represent a co-occurrence 
pattern of allele-specificity across multiple proteins. The 
marginal correlation among multiple datasets is described 
by multiple clusters, whereas within each cluster the data 



in different datasets are generated independently. In real 
data, a small K (i.e., a small number of SNP classes) usu- 
ally is sufficient to describe the major correlation structure 
among datasets. Using tt, V and W to describe the cor- 
relation among datasets only requires 0(KD) parameters, 
which is significantly less complex than 0(3 D ) parame- 
ters. At the same time, the iASeq model still provides the 
flexibility to accommodate all 3 D possible [ B[\ Q] con- 
figurations as all of them have non-zero probability to 
occur. 

Data generating distributions 

To fully specify the model, one also needs to specify 
the data generating distributions Pr(x id j,n id j\b id ,c id ) = 
Pr(n idj )Pr(xidj\n idj ,bi d ,Ci d ). The primary goal of iASeq 
is to infer whether two alleles are different. We assume 
that information on allele-specificity is only contained 
in Pr(xi d j\ni d j, bi d , Ci d ), and therefore the exact form of 
Pr(rii d j), i.e., the marginal probability distribution of the 
total read count, is irrelevant for our purpose. As such, 
we mainly focus on modeling the conditional distribution 
of XMj given n^p bid and C[ di i.e., the three distributions 
fidjo(*)>fidji(x) andf idJ2 (x). 

iASeq models these distributions hierarchically in two 
steps. First, Xtdj is assumed to follow a binomial dis- 
tribution Xidj\n idj) pi dj ~ Bin(n id j,p id j), where p idj is the 
probability that a read generated at SNP i in sample 
(d,j) represents the reference allele. Next, we model pi d j 
depending on the values of b[ d and C[ d . 

If bi d = 0 and C[ d = 0, SNP i is NS in dataset d. In 
this case, we assume that pi d j follows a Beta distribu- 
tion Beta(a d j, p d j) with mean p d j 0 = ot d j/(a d j + Pdj)- Note 
that a simpler model for pi d j would be to set it to a con- 
stant p d jo which reflects the background ratio of read 
counts between two alleles. However, previous studies 
have shown that many background SNPs can have pi d j 
slightly different from the average background p d jo even 
though they do not have biologically meaningful allele- 
specificity [33] . As a result, a constant p d jo is not sufficient 
to describe the background variation. For this reason, we 
adopt the Beta distribution to describe pi d j instead of set- 
ting it to a constant (See the blue lines illustrated for 
f(Pidj\bid = 0, Cid = 0) in Figure lc). In the ideal world, the 
mean of the Beta distribution, p d jo, would be equal to 0.5. 
However, in reality p d jo may be slightly different from 0.5 
due to various sources of read mapping biases. For exam- 
ple, allowing the same number of mismatches, reads from 
the reference allele are easier to be mapped back to the ref- 
erence genome than reads from the non-reference allele. 
Therefore, in iASeq p d jo may take values different from 
0.5. Indeed, it is determined by the parameters a d j and /3 d j 
in the Beta distribution which are estimated from the data 
using a moment matching approach (see Additional file 1: 
Supplemental Method S.2). Once estimated, a d j, /3 d j and 
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Pdjo are treated as fixed and known parameters. Based on 
the model for p^, we integrate out all possible values of 
pidj to obtain the distribution of Xidj conditional on bid = 0 
and Cid = 0, which is a beta-binomial distribution: 

fidjoipCidj) = Pr {xidj\n id j> bid = 0, c id = 0) 

= f Pr(Xidj\nidj,Pidj, bid = 0, c id = 0) 
Jo 

x / (pidj\hd = 0, c id = 0) dpidj 

nid ) / p mjtedr l (i- p ) n idr x idj+Pdr l dp 
ij)Jo 



B (a d p Pdj 
C ntdj B { x idj + <*dp n idj ~ %idj + Pdj) 

B (a d p Pdj) 



(1) 



Here is the binomial coefficients "n choose k", and B(., .) 
is the beta function. 

If bid = 1 an d Cid — 0, SNP i is SR in dataset d. In 
this case, we assume that pidj follows a uniform distri- 
bution U[pdjo> l](See the dark blue lines illustrated for 
f(Pidj\bid = hc id = 0) in Figure lc). Hereto = 
a dj/( a dj + Pdj) is defined as above. After integrating out 
Pidj, the distribution of Xidj conditional on bid — 1 an d 
Cid = 0 is 

fidjifridj) = Pr(Xidj\nidj,bid = hc id = 0) 



= / Pr(Xidj\n id j, Pidj, bid = hc id = 0) 
Jo 

*f(Pidj\bid = hc id = 0)dp id j 

(fid; ! 

= p X idj(\-p) n idr x idjdp (2) 

l -Pdj0Jp dj0 

If bid = 0 and = 1, SNP / is SN in dataset 
d, and we assume that pidj follows a uniform distribu- 
tion U[0,pdjo] (See the light blue lines illustrated for 
f(Pidj\bid = 0, c id = 1) in Figure lc). After integrating out 
Pidj, the distribution of Xidj conditional on bid = 0 and 
Cid = 1 is 

fidjiiXidj) = Pr(xidj\riidj,bid = 0,c id = 1) 

= [ f(*idj\mdj>Pidj>bid = o,c id = l) 

Jo 

*f(Pidj\bid = 0,c id = \)dpidj 



r x idj 
PdjO 



PPdjO 

/ P 

Jo 



Pidj (I - p) n idr X idj dp 



(3) 



Joint probabilities and model fitting 

Based on the model above, the complete data likelihood 
can be derived as: 



Pr(X,N,A,B,C\n,V, W) 

= Pr(N)Pr(X,A,B, C\N, it, V, W) 

i 



= Pr(N) [7 Pr{X h a h B h d\N h 7t 7 V 7 W) 



(4) 



i=l 



Define L id0 = I^ti fidjo(*idj)> Ud\ = Ylf=i fidjii^idj) 

and L id 2 = Y^jLifidjiiXidj)- Define 8(.) to be an indica- 
tor function. = 1 if its argument is true, and 8(.) = 0 
otherwise. We have 



Pr{X u a u B u Ci\N h n, V, W) 



= Pr (ai\n) Y\Pr (b id , Cid\a h V,W)Pr(Xid\Ni d , a h b id , c id ) 
d=i 

D ^(^=0) k r D 

*o n L ido n \ nk n ^^i^ 



d=l 



k=l I d=l 



X [(1 - V kd - W k d)L id o] 



.t^ 1 - bid -Cid 



S(ai—k) 



(5) 



To infer tt, V and W, we employ a Bayesian approach by 
imposing a Dirichlet prior D(rj, • • • , rj) on Jt and impos- 
ing independent Dirichlet priors D(rj, rj, rj) on all triplets 
(vkd> w kd> 1 — v kd — w kd)- The joint posterior distribution 
of unknown parameters and indicators given the observed 
data is: 



Pr(AAC,n,V,W\X,N) 

oc Pr (X,N,A,B, C\n, V, W) f (tt, V, W) 



oc Y\Pr(Xi,ai,Bi,Ci\Ni,7t,V,W) 



i=l 



k=0 J 



K D 



x \ Y\Y\ v ld lw ld l ^-v kd -w kd ) 
y k =id=i 



(6) 



Conditional on the observed data, Pr(N) is a constant 
that does not contain parameters of interest, therefore it is 
absorbed into a proportionality constant not shown in the 
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formula above. Using this joint posterior, an EM algorithm 
can be derived to search for posterior mode (tt, V, W) of 
Pr(jr, V, W\X,N) = Y,A,B,c Pr ( A > B > v > W\X,N) in 
which the missing indictors A, B and C are all integrated 
out (see Additional file 1: Supplemental Method S.4). 

For the Dirichlet prior, we use r\ = 2 (see Additional 
file 1: Supplemental Method S.3 for a discussion on the 
choice of parameter for the Dirichlet prior). In the EM 
algorithm, we assume that the class number K is given. 
In order to choose the optimal 7<T, we run the algorithm 
multiple times using different values of K. We choose the 
best K using the Bayesian Information Criterion (BIC) 
(see Additional file 1: Supplemental Method S.5). 

Statistical inference of allele-speciflcity 

The estimated tt, V and W can describe the corre- 
lation patterns of allele-specificity among datasets. 
Given tt, V and W, one can infer whether SNP i 
belongs to class k based on the posterior probability 
Pr(ai = k\X h Ni,Ti, V, W) (see Additional file 1: Supple- 
mental Method S.4 equations S.12-S.13). One can 
then infer whether each SNP i is skewed in each 
individual dataset d based on the posterior probabil- 
ity Pr(bid, cui\Xi,Ni, 7t,V,W) = J2cn Pr ( a i> b id> c i( i\X h N h 
tt, V, W) after summing over all possible values of d{ 
(see Additional file 1: Supplemental Method S.4 equation 
S.14). Note that 

Pr{b id ,c id \Xi,Nun,V,W) = 

Y^Pr(a i = k\X h N h jt > V > W)Pr(b id) c id \a i = k ) X h N h jt > V, W) 

k 

(7) 



Define 

Pid = max J Pr (b id = 1, c id = 0\X h N h jt, V, W) , 

Pr (b id = 0, = 1 \X h Ni, n,V,W) • 

(8) 

Using Pi df SNPs can be rank ordered for biologists to 
choose candidates to design follow-up studies. For each 
top ranked SNP, one can determine its skewing direction 
by comparing Pribid = 1, = 0\Xi,Ni, it, V, W) and 
Pribid = 0,c id = l\Xi,Ni,7t, V, W). The one with the 
larger value determines the direction. Finally, the poste- 
rior probabilities of top N SNPs can be converted to an 
estimate of false discovery rate (FDR) using FDR(N) = 

Y.ietop N SNPs(l ~ P id)/N. 

Formula 7 shows that two types of informa- 
tion contribute to Pr(b id , c id \X h N h tt, V, W): (1) 
Pr(ai = k\Xi,Ni,7t, V, W), which is determined using 



information from all D datasets, and (2) Pr(bid,Cid\cii = 
k,Xi,Ni,Jt, V, W), which only uses information specific 
to dataset d conditional on tt, V and W. Thus for each 
particular dataset d, the dataset-specific information is 
weighted by information obtained from other datasets to 
determine the SNP ranking. Intuitively, if allelic imbalance 
in two datasets are correlated, then observing an AS event 
in one dataset will suggest that a relatively weak skewing 
event observed at the same SNP in the other dataset is 
very likely to be a true AS event. In contrast, if no AS 
event is observed in one dataset, then a relatively weak 
skewing event observed at the same SNP in the other 
dataset is likely to be a false positive. This is the underly- 
ing nature of using Pr(at = k\Xi,Ni, tt, V, W) to re-weigh 
information in Pr(bi d ,Cid\ai = k,Xi,Ni,n, V, W), and it 
provides the foundation for improving SNP ranking by 
borrowing information across datasets. In real applica- 
tions, 7r, V, W are unknown, and they are replaced by the 
posterior mode obtained from the EM algorithm. 

Results 

GM12878 data and preprocessing 

We collected 40 ENCODE [36] ChlP-seq datasets with a 
total of 77 samples together with a genomic DNA sample 
in GM12878 lymphoblastoid cells (Additional file 2: Table 
SI). GM12878 is a female and is one of the most exten- 
sively studied cell lines in ENCODE. Within each dataset, 
the number of replicate samples varied from 1 to 3. We 
downloaded the raw sequence reads of all 78 samples and 
mapped them to human genome (hgl8) (see details in 
Additional file 1). We removed repeated sequences from 
the ChlP-seq datasets to avoid PCR duplicates, which may 
skew the determination of allelic biases. In other words, 
if multiple reads have exactly the same sequence, only 
one copy is retained. We obtained the genotype data for 
GM12878from [37]. 

As previously described in [31], there are two different 
types of read mapping biases that may affect the analy- 
sis of AS events: the reference bias and the inherent bias. 
The reference bias often occurs when one maps sequence 
reads to a reference genome. If one allows the same num- 
ber of mismatches in the alignment, a read from the 
non-reference allele is less likely to be mapped back to 
the reference genome compared to a read from the ref- 
erence allele, since the non-reference read has one more 
mismatch to the reference genome. This phenomenon is 
known as the reference bias. This type of bias, if it exists, 
is automatically taken care of by the iASeq model through 
the parameter p d j 0 which models the background skew- 
ing probability and is estimated using all reads mapped 
to heterozygote SNPs in each sample. If there is reference 
mapping bias, p d jo will take a value different from 0.5 to 
adjust for the bias. One may remove reference bias before 
the analysis by masking SNPs in the reference genome 
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during the alignment or by aligning reads to a diploid 
personal genome. This situation will also be automatically 
recognized by iASeq through the estimation of p^jo from 
the data (if there is no bias,^/o = 0.5). Therefore, regard- 
less of whether the reference bias has been removed from 
the data in the preprocessing or not, the iASeq model is 
able to automatically handle it and adjust the inference 
accordingly. 

The intrinsic bias is a different type of bias. As shown 
by [31], even if the reference bias is removed (e.g., by 
masking SNPs in the reference genome), the inherent 
bias still exists. For example, suppose sequence 1 (e.g., 
xxxAxxx) and sequence 2 (e.g., xxxTxxx) are two reads 
that differ only in one position (i.e., A/T). It is possi- 
ble that sequence 1 is easier to be mapped back to its 
correct location in the genome than sequence 2 if the 
second sequence has many repeats in the genome. This 
bias reflects the inherent characteristics of the genome 
and cannot be removed by masking variants in the refer- 
ence genome or by mapping reads to a diploid personal 
genome. In the above example, masking A and T in the 
original reads is also not a solution, since a priori one 
does not know which position in a read corresponds to 
a SNP position and therefore should be masked without 
first aligning the read to the genome. When a heterozy- 
gote SNP has inherent bias, one allele will have higher read 
counts than the other even if the two alleles have the same 
binding level. To avoid this bias, we used the approach 
described in [22,31] to remove SNPs with the inherent 
bias. 

We began with 1,704,166 heterozygote SNPs and fil- 
tered out 149,996 (8.8%) SNPs with inherent bias. 
Next, we eliminated SNPs that were not bound by 
any TF or associated with any HM in any dataset 
(see Additional file 1: Supplemental Methods S.l.l, 
S.1.2 and Additional file 3: Table S2 for details). 
After applying these niters, 94,519 heterozygote SNPs 
remained. These 94,519 SNPs were then analyzed by 
iASeq. 

A simulation study 

Before we apply iASeq to the real data, we first tested 
its performance in simulations that took into account 
real data characteristics. Our simulations kept the same 
design as the real GM12878 ChlP-seq data, with the same 
number of datasets and the same number of replicates 
within each dataset, except that the genomic DNA sam- 
ple was not used here since we knew the truth in the 
simulations and did not need genomic DNA as a con- 
trol for potential bias. To create the simulation data, we 
first applied iASeq to the real GM12878 data to identify 
86,353 SNPs that were not skewed in any dataset using 
Priai — 0\Xi,Ni, n, V, W) > 0.5 as cutoff. To mimic 
the real background noise, these SNPs were resampled by 



a bootstrap procedure to create the background SNPs in 
the simulations, and we kept the read counts (Xidj, nidj) of 
each background SNP as is in the simulated data. Next, 
we simulated ASB SNPs and added them to the back- 
ground. Simulations were carried out under two different 
scenarios (Figure 2). 

• Scenario 1: Two types of ASB SNPs (classes 1 and 2) 
were created in addition to the background SNPs 
(class 0). The SNP number for class 0, 1, and 2 was 
85,069, 4,725 and 4,725 respectively. Thus the true n k 
for the three classes was 0.90, 0.05 and 0.05 
respectively. SNPs in class 1 were SR in datasets 1 to 
30 (i.e., their b id = 1 for d = 1, • • • , 30). SNPs in class 
2 were SN in datasets 1 to 30 (i.e., = 1 for 

d = 1, • • • , 30). In datasets 31 to 40, no SNPs had 
allelic imbalance. Class 2 can be viewed as the mirror 
image of class 1. This symmetric design reflects the 
symmetry of allele-specificity, that is, the skewing to 
the reference allele and to the non-reference allele is 
approximately symmetric. The class abundance 
(0.90,0.05,0.05) roughly matched the abundance 
observed in the analysis of real GM12878 data. 

• Scenario 2: Four correlation patterns (classes 1-4) 
were created in addition to the background class 
(class 0). Class 1 and class 2 were the same as in 
simulation 1. Classes 3 and 4 were two new patterns. 
SNPs in class 3 were SR in datasets 21-40, and SN in 
datasets 1-10. Class 4 was the mirror image of class 3. 
The abundance of the classes 0 to 4 was 
(0.90,0.025,0.025,0.025,0.025). 

Given the simulated [ Bf, Q] configurations, we then 
simulated the read count data for ASB SNPs as described 
in detail in Additional file 1: Supplemental Methods S.6. 
Simulations done in this way was able to keep the major 
characteristics of real data while allowing us to benchmark 
the performance of different methods since we knew the 
truth. 

We applied iASeq to both simulations. In both cases, 
iASeq was able to identify the correct number of SNP 
classes using BIC (Figures 2a,b,d,e). Figures 2c and 2f 
show that the ASB patterns reported by iASeq matched 
the true patterns well. In order to test whether iASeq 
can improve the statistical power of detecting SNPs with 
allelic imbalance, we compared the SNP ranking provided 
by iASeq with rankings provided by five other meth- 
ods that analyze each dataset separately (Figure 3). In 
iASeq, SNPs were ranked in each dataset d according 
to the posterior probability defined by Formula 8. 
Since we know the truth, we can count how many of 
the top N SNPs were true positives. Here the true posi- 
tives were defined as SNPs that were truly allele-specific 
and also had the skewing direction correctly inferred. 
The five single-dataset based methods for ranking SNPs 
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Figure 2 Simulation design and patterns discovered by iASeq. (a) The true ASB patterns in simulation 1 . Two patterns were simulated in 
addition to the background pattern. The two non-background patterns are shown. Each pattern has 4725 SNPs. Each row in the plot represents a 
SNP class, and each column represents a dataset. Black means skewed, and white means not skewed, (b) The BIC values for different class number K 
in simulation 1 . The BIC achieves the minimum at K = 2. (c) Patterns discovered by iASeq in simulation 1 . The plot shows the estimated V and W 
when K = 2. Each row corresponds to a class. Each column represents a dataset. The color in the cell (k, d) demonstrates the estimated SR or SN 
probability in class k and dataset d. From white to dark, the probability increases from 0 to 1 . The numbers shown under n are the estimated 
number of SNPs in each class (i.e., jtk* the total number of SNPs). The numbers shown under a,- are the number of SNPs identified for the 
corresponding class using the posterior probability Pr(Oj = k\X h Ni,n, V, W) > 0.9 as cutoff, (d) The true ASB patterns in simulation 2. Four 
patterns were simulated in addition to the background pattern. The four non-background patterns are shown. Each pattern has 2362 SNPs. (e) The 
BIC values for different class number K in simulation 2. The BIC achieves the minimum at K = 4. (f) The patterns discovered by iASeq in simulation 2. 



include a deviation statistic d, naive z statistic, naive Bayes 
statistic, empirical Bayes statistic and single dataset EM. 
These methods were applied to each individual dataset. 
For each dataset d, we merged data from all replicates 

to obtain x id = Eyli x idj and n id = Yljti n idj- We 
then computed the statistics used for SNP ranking as 
described below. 



1. Deviation statistic (d): SNPs were ranked based on 
\*id/nid -Pdol Here we estimated 

PdO = jr Hi:n id ^oPid r ^/:// /(/ =u n ld ' 

the number of SNPs for which n^ ^ 0. 

2. Naive z statistic (z) : SNPs were ranked based on 
= . Here pdo was estimated as in the 



rE^ot' where/ ' is 



\xid/n id -Pdo\ 



y/(PdO*(l-Pdo)/nid)' 

deviation statistic d. 
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Figure 3 The Receiver Operating Characteristic (ROC) curves for simulations, (a)-(c) We plot the number of true allele-specific SNPs (i.e., true 
positives, TP) among the top q ranked SNPs in each dataset against the rank cutoff q. Results for different methods in three representative datasets 
in simulation 1 are shown. Results in all other datasets were similar, (d) For each ranking method and each dataset, we computed the area under 
the ROC curve (AUC) using the 2000 top ranked SNPs. dAUC, the proportion of improvement of AUC brought by iASeq over the best AUC obtained 
from the single-dataset based methods, was computed for each dataset. dAUC > 0 means iASeq brings improvement. The distribution of dAUC in 
all 40 datasets is shown for simulation 1 . (e)-(g) Results in three representative datasets from simulation 2. Results in all other datasets were similar, 
(h) The distribution of dAUC in all 40 datasets is shown for simulation 2. 



3. Naive Bayes statistic (b): SNPs were ranked using 
\{Xid + 2*p d0 )/(n id + 2) -pdo\. Here 

Pdo = 7 J2i XU nu+2° where Pdo was estimated as in 
the deviation statistic d. The implicit assumption here 
is that %id\Pid ~ Bin(n id ,p id ) and p id ~ Beta(pt di fi d ) 
with a d = 2p d0 and fi d = 2(1 -p d0 )- The posterior 
mean of pi d is used to construct the ranking statistic. 

4. Empirical Bayes statistic (B): SNPs were ranked using 
\ipcid + oi d )/(n id + a d + p d ) -p d0 \. We estimated 



Pdo = 



&d+Pd 



. The implicit assumption is the same as 

the naive Bayes statistic, but now we estimate a d and 
/3 d based on the observed data using the method of 
moments as in iASeq (see Additional file 1: 
Supplemental Method S.2). 
5. Single dataset EM (singleEM) : We fitted a mixture 
model of SR, SN and NS with distributions 
fidjp(')>P = 0, 1, 2 and mixing probabilities v d , w d and 
1 — v d — w d for each dataset d without considering 
other datasets. SNPs were ranked using a posterior 
probability similar to Pi d , but now determined based 
on information in dataset d only (see Additional 
file 1: Supplemental Method S.7 for details). 

Figure 3 compares the number of true positives, TP d (q), 
in the top q SNPs reported by each method in each dataset 
d. In Figures 3a-c and 3e-g, TP d (q) is plotted as a function 



of q in a few representative datasets. These plots show 
that iASeq outperformed all single-dataset based meth- 
ods, and it was able to substantially improve the power for 
detecting allele-specificity. 

In general, the observed differences between iASeq and 
the d, z, b and B statistics could be caused by many fac- 
tors such as use of different statistical models, ranking 
statistics, or methods for parameter estimation. However, 
the comparison between iASeq and the single dataset EM 
represents a well-controlled comparison since these two 
methods used exactly the same distributional assumptions 
and parameter estimation methods. The only difference 
between them was that iASeq used information from mul- 
tiple datasets whereas singleEM was based on one dataset 
only. This well-controlled comparison shows that jointly 
modeling multiple datasets is able to improve the allelic 
inference. 

To examine whether iASeq was able to bring improve- 
ment in all datasets, we computed the Area under the 
Receiver Operating Characteristic (ROC) curves (AUC) 
for each method in each dataset using the top 2000 
ranked SNPs. In each dataset, we computed the propor- 
tion of improvement in terms of AUC brought by iASeq 
over the best single-dataset based ranking method (i.e., 
dAUC = AUC ^-AUC bestsingk dAUC > Q means . A 

s\U. ^bestsingle 

is able to bring improvement. Figures 3d and 3h show the 
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distribution of dAUC across all 40 datasets as a histogram. 
The results show that iASeq was able to improve the SNP 
ranking in almost all datasets. 

In Figure 4, we converted the iASeq posterior proba- 
bilities of top N SNPs to FDR estimates and plotted the 
estimated FDR against the true FDR. The figure shows 
that iASeq was able to provide reasonable FDR estimates 
as well. Shown in the figure are a few representative 
datasets. Results in all other datasets were similar. 

Analysis of real data 

Our simulation study demonstrates the ability of iASeq 
to discover correlation patterns of allele -specificity and 
improve the detection of skewed SNPs. Next, we applied 
iASeq to analyze the 41 real datasets (78 samples) in 
GM12878 cells. In real data, we do not have comprehen- 
sive knowledge about the truth. Therefore, unlike sim- 
ulations, we were not able to assess the FDR estimates. 
For this reason, we mainly focused on analyzing the cor- 
relation patterns of allele-specificity and testing whether 
iASeq can improve the SNP ranking. 

Correlation patterns of allele-specificity 

Figure 5a shows the BIC in the real data. Based on BIC, the 
optimal K was 2. In other words, in addition to the back- 
ground class (k = 0), iASeq discovered two other SNP 
classes, representing different allele-specificity patterns. 
For these two non-background classes, 7i> was estimated 
to be 0.0696 and 0.0691 respectively, suggesting that they 
cover 6.96% and 6.91% of the analyzed SNPs. Due to the 



background noises, not all SNPs in these two classes can 
be confidently detected. At the 0.90 posterior probabil- 
ity cutoff, iASeq reported 1868 and 2138 SNPs for classes 
1 and 2 respectively (Figure 5b). Note that our simula- 
tions had similar settings as the real data analysis, and 
they showed that iASeq was able to discover more than 
two patterns if they are supported by the data. There- 
fore our discovery of two correlation patterns here is likely 
driven by the data, that is, the information in the data 
is only sufficient for supporting robust discovery of two 
patterns. 

Figures 5b and 5c show the posterior mode of V# and 
Wk for the two non-background classes. It turned out that 
these two classes corresponded to two global directions 
of allele-specificity, SR and SN, respectively. Since the 
assignment of reference or non-reference allele depends 
on the reference genome, the assignment per se is not of 
biological interest. However, recall that GM12878 is a sin- 
gle person, therefore at each single SNP, the nucleotide 
representing the reference or non-reference allele is the 
same across all datasets analyzed here. Given this fact, 
what these results essentially tell is that at each single 
SNP, most TFs and HMs in our analysis were highly 
correlated in terms of allele-specificity, and if they are 
skewed, they tend to be skewed toward the same direc- 
tion (i.e., the same allele). For instance, for SNPs in class 1, 
both H3K4me3 (from the Broad Institute) and H3K27ac 
(Broad) had high probability to be SR, with {vkdi w kd) 
equal to (0.9337,0.0070) and (0.9730,0.0041) respectively 
(Figure 5c). The probability that one is SR and the other 
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Figure 4 Estimated FDR against true FDR in simulations, (a)-(d) Results for four representative datasets in simulation 1 . (e)-(h) Results for four 
representative datasets in simulation 2. Results for all other datasets were similar. 
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Figure 5 Correlation patterns of allele-specifkity among different TFs and HMs in GM1 2878 cells discovered by iASeq. (a) The BIC values 
for different class number K. The BIC achieves the minimum at K = 2. (b) The estimated V and W when K = 2. Each row corresponds to a class. 
Each column represents a dataset. The color in the cell (k, d) represents the SR or SN probability in class k and dataset d. From white to dark, the 
probability increases from 0 to I.The bar plot and the numbers shown under tt are the estimated number of SNPs in each class (i.e., TCk* the total 
number of SNPs). The bar plot and the numbers shown under o\ are the number of SNPs identified for the corresponding class using the posterior 
probability Pr(Oj = k\Xj,Nh n, V, W) > 0.9 as cutoff, (c) A closer look at V and W in a number of representative datasets. The barplots show the 
estimated SR and SN probabilities and in a number of selected datasets. Left: the skewing probabilities in class 1 . Right: the skewing 
probabilities in class 2. The height of each bar represents the SR or SN probability. 



one is SN was small Similarly, for SNPs in class 2, both 
H3K4me3 and H3K27ac were highly likely to be SN 
simultaneously {{v kdl w kd ) =(0.0061,0.9835) for H3K4me3 
(Broad) and (0.0040,0.9897) for H3K27ac (Broad)). While 
the allelic imbalance of most TFs and HMs were highly 
correlated, H3K27me3, a HM involved in gene repres- 
sion, was an exception. In both non-background classes, 
H3K27me3 had much lower skewing probabilities com- 
pared to the other proteins (Figure 5c). Within each class, 
the difference in the skewing probability between the two 
alleles was also much weaker for H3K27me3 as compared 
to the other proteins. For instance, in class 1, while most 
other proteins showed strong preference to be skewed 
toward the reference allele, H3K27me3 can be skewed to 
the reference allele at some SNPs and skewed to the non- 
reference allele at many other SNPs. Therefore, the allelic 
imbalance in H3K27me3 is not strongly correlated with 
the allelic imbalance of the other proteins analyzed here. 
For the genomic DNA which was used as control here, 
the skewing probabilities (v^, w k d) in both classes were 
fairly low as shown in Figure 5b-c. In both classes, the 
probability for not being skewed in the genomic DNA (i.e., 
1 — v kd — w kd) was bigger than 0.95. This indicates that the 



high probability of skewing observed in the other datasets 
was not an artifact. 

The coordinated allelic imbalance of different proteins 
toward the same allele has also been observed in a recent 
study [38]. In that study, the authors analyzed AS of 24 TFs 
and found that when multiple TFs bind to the same SNP, 
they frequently bind to the same allele. Moreover, those 
authors did not observe any pair of TFs that regularly bind 
the same position on alternate alleles. Our observation 
here therefore is consistent with their finding. 

Increased power for detecting allele-specificity compared 
with single dataset analysis 

We ranked SNPs based on the posterior probabilities P(d 
in each dataset. The iASeq ranking was compared with the 
rankings provided by the five single-dataset based meth- 
ods described above. Since we do not know the truth, 
we used two types of independent information as gold 
standard to benchmark the ranking results. 

First, we evaluated different methods by counting how 
many of their top ranked SNPs were located in the non- 
pseudoautosomal regions of chromosome X (chrX-npa) 
(Figure 6). GM12878 is a female lymphoblastoid cell line. 
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Figure 6 The ROC curves with chrX-npa SNPs as gold standard in the GM1 2878 analysis. We plot the number of non-pseudoautosomal 
region X chromosome SNPs, denoted by TP^iq), among the top q ranked SNPs in dataset d as a function of the rank cutoff q for each method. 
(a)-(g) Results in 7 representative datasets. (h) In each dataset, we computed the area under the ROC curve (AUC) using the 2000 top ranked SNPs 
for each method. dAUC, the proportion of improvement of AUC brought by iASeq over the best AUC from the single-dataset based methods, was 
computed for each dataset. The distribution of dAUC in all 40 datasets is shown. 



In GM12878, SNPs in chrX-npa are expected to be allele- 
specific due to cells rapidly become clonal in culture 
leading to a skewed X-inactivation [5,38,39]. Therefore, 
given a fixed number of top SNPs, the more chrX-npa 
SNPs one can find, the more powerful a method is. 
Figure 6 shows that iASeq clearly increased the power 
for detecting allele-specificity in each dataset compared to 
the single-dataset based analysis. For example, Figure 6a 
shows that in the H3K27ac dataset generated by the Broad 
Institute, iASeq was able to identify 122 chrX-npa SNPs 
among the top 500 SNPs. This represents 126% improve- 
ment compared to singleEM, the best single-dataset based 
ranking method in that dataset, which only identified 54 
chrX-npa SNPs. Figures 6a-g show results in a few rep- 
resentative datasets. Figure 6h shows the distribution of 
dAUC (i.e., the proportion of improvement of AUC by 
iASeq over the best single-dataset based ranking method 
in each dataset) in all 40 datasets. These plots clearly show 
that iASeq outperformed all single-dataset based meth- 
ods in all datasets and the average improvement in AUC 
was 354%. 

Second, we evaluated different methods by using inde- 
pendent RNA-seq data. From RNA-seq, one can identify 
exonic ASE SNPs and use them as gold standard. We col- 
lected two RNA-seq datasets in GM12878, one from the 
California Institute of Technology (Caltech) and the other 
from the Yale/Stanford University (Yale) (Additional file 2: 
Table SI). From each dataset, we identified the top 400 



exonic ASE SNPs using the naive Bayes statistics. Using 
the other methods to identify the gold standard ASE SNPs 
produced similar results which, for simplicity, will not be 
shown here. Based on these exonic ASE SNPs, we denned 
a SNP in our ChlP-seq analysis as truly allele-specific 
if there was an exonic ASE SNP in its Xkb neighbor- 
hood. Here we tried both X = lOkb and X = lkb and 
obtained similar results. Below we illustrate the results 
using X = lOkb as an example. The corresponding 
results for X =lkb are shown in Additional file 4: Figures 
S3-S6. Among the 94,519 SNPs analyzed in the ChlP-seq 
data, 20,526 had one or more exonic SNPs within its lOkb 
neighborhood and therefore could potentially be linked 
to an exonic ASE SNP. Figure 7 and Additional file 5: 
Figure SI compare rankings of these SNPs provided by dif- 
ferent methods in terms of how many of the top ranked 
SNPs are true positives (i.e., associated with ASE). iASeq 
again outperformed all the other single-dataset based 
ranking methods. For instance, based on the Caltech gold 
standard, iASeq on average identified 144% more true 
positive SNPs among the top 500 SNPs (Figure 7a-g). 
According to the Yale gold standard, iASeq achieved an 
average of 149% improvement in terms of the true posi- 
tive rate among top 500 SNPs (Additional file 5: Figure SI). 
The average improvement in terms of AUC (i.e., dAUC) 
across all 40 datasets was 148% (Figure 7h) and 165% 
(Additional file 5: Figure Slh) for the Caltech and Yale gold 
standard respectively. 
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Figure 7 The ROC curves in GM1 2878 data using Caltech RNA-seq ASE SNPs as gold standard. We plot TPd(q), the number of true allele- 
specific SNPs among the top q ranked SNPs in dataset d, against the rank cutoff q for each method. The true allele-specific SNPs are defined as SNPs 
that have > 1 RNA-seq exonic ASE SNPs in their 1 0kb neighborhood, (a)-(g) Results in 7 representative datasets. (h) In each dataset, we computed 
the area under the ROC curve (AUC) using the 2000 top ranked SNPs for each method. dAUC, the proportion of improvement of AUC brought by 
iASeq over the best AUC from the single-dataset based methods, was computed for each dataset. The distribution of dAUC in all 40 datasets is shown. 



To ensure that the increased statistical power was 
not completely attributed to X chromosome SNPs, we 
repeated the benchmark analysis based on RNA-seq using 
only SNPs in autosomal chromosomes, and we obtained 
similar results (Figure 8, Additional file 6: Figure S2). This 
shows that the increased power is not only contributed by 
chrX SNPs. 

Comparisons with other methods 

Most existing studies on allele-specificity were conducted 
using in-house data analysis pipelines. A tool developed 
by Skelly et al. [33] and AlleleSeq [32] are two software 
tools accessible to third-party users for AS analysis. The 
method proposed by Skelly et al. [33] is designed for ana- 
lyzing ASE in RNA-seq data. It first fits a background 
model using genomic DNA and then feeds the estimated 
parameters into a Bayesian model that combines infor- 
mation from multiple SNPs within a gene to infer ASE. 
When we applied this method to analyzing the GM12878 
ChlP-seq data, two problems occurred. First, the method 
uses Markov Chain Monte Carlo (MCMC) to fit the back- 
ground model from the genomic DNA which, as alerted 
by [33], is well-known for its slow speed and difficulties for 
users to monitor the convergence. Our genomic DNA data 
had 94,519 SNPs which covered 12,417 genes. Running 
this algorithm on this data using the parameter settings 
recommended by [33] on a machine with 2.7 GHz CPU 
and 4 Gb RAM took more than 60 days. Second, after 



feeding the background model parameters obtained from 
the first step to the inference model in the second step, 
the algorithm stopped execution after a few iterations. 
This is because the original model was developed for 
deeply sequenced RNA-seq rather than ChlP-seq, where 
the average read count covering a heterozygote SNP in 
a ChlP-seq dataset is only 0.64. As a result, the model 
developed in [33] did not fit the real data in ChlP-seq 
experiments. This lack-of-fit caused the program to stop 
early, likely due to the abnormally fitted parameters caus- 
ing various computation problems (e.g., overflow). For this 
reason, although the method proposed by [33] represents 
an advanced solution for analyzing RNA-seq ASE, it can- 
not be directly used to analyze ASB in ChlP-seq data with- 
out significantly redesigning the model and algorithm. For 
this reason, it is not further compared here. 

AlleleSeq [32] is another tool for AS analysis. It has 
been used to analyze ASB of several TFs in GM12878 
[32]. AlleleSeq is more focused on the preprocessing step. 
Its pipeline first constructs a diploid personal genome 
sequence according to family trio data and then maps 
ChlP-seq reads to this personal genome. After remov- 
ing various biases, the method then analyzes allele- 
specificity in each individual ChlP-seq dataset separately. 
[32] applied AlleleSeq to analyze 7 different TF datasets 
in GM12878, among them 5 were also included in our 
iASeq analysis. We compared iASeq and AlleleSeq using 
these same 5 datasets. We first obtained the ASB SNPs 
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Figure 8 The ROC curves in GM1 2878 data using Caltech RNA-seq autosomal ASE SNPs as gold standard. We plot TPdiq), the number of 
true allele-specific SNPs among the top q ranked autosomal SNPs in dataset d, against the rank cutoff q for each method. The true allele-specific 
SNPs are defined as autosomal SNPs that have > 1 RNA-seq exonic ASE SNPs in their 1 0kb neighborhood, (a)-(g) Results in 7 representative 
datasets. (h) In each dataset, we computed the area under the ROC curve (AUC) using the 2000 top ranked SNPs for each method. dAUC, the 
proportion of improvement of AUC brought by iASeq over the best AUC from the single-dataset based methods, was computed for each dataset. 
The distribution of dAUC in all 40 datasets is shown. 



reported by AlleleSeq from [32]. Let denote the num- 
ber of reported ASB SNPs for each TF dataset d. We next 
obtained the top SNPs ranked by iASeq. We then com- 
pared these two methods based on how many of their top 
Tj SNPs were in chrX-npa, and how many of them were 
associated with exonic ASE SNPs determined by RNA- 
seq. For the benchmark analysis based on RNA-seq, we 
associated exonic ASE SNPs with ChlP-seq SNPs using 
both lOkb and lkb neighborhood. We also performed 
the comparison after excluding the chromosome X SNPs. 
Table 1, Additional file 7: Table S3, Additional file 4: Tables 
S4-S5 and Additional file 8: Figure S7 show that iASeq 
either outperformed or performed comparable to Alle- 
leSeq in all datasets. Sometimes, the improvement was 
substantial (e.g., YaleMYC in Table 1). 

Discussion 

Interpretation of the correlation patterns 

When analyzing the real data in GM12878, iASeq found 
two non-background AS patterns, representing two oppo- 
site directions of allelic imbalance. Since the assignment 
of reference and non-reference allele depends on the refer- 
ence genome, whether a SNP is skewed toward reference 
or non-reference allele per se does not have direct biolog- 
ical meaning. What these two patterns essentially suggest 
is that the allelic imbalances of multiple proteins at a single 
SNP are correlated and have high preference to be skewed 



toward the same allele. In other words, the two patterns 
should be viewed as a pair and interpreted together. 

In general, although one may view different allelic 
imbalance patterns in iASeq as different clusters of SNPs, 
these clusters only describe the similarities among SNPs 
in terms of their skewness directions, rather than the simi- 
larities in terms of their functions. The direction is denned 
using the reference/non-reference allele. The reference or 
non-reference allele for different SNPs can have different 
meanings (e.g., for one SNP, the maternal allele may be 
the reference allele, whereas for another SNP the pater- 
nal allele may be the reference allele). Therefore within 
each cluster, even though SNPs have similar skewness pat- 
tern, they are not necessarily functionally related to each 
other. One should not confuse the SNP clusters here with 
the clusters obtained from the traditional gene expression 
microarray data analysis, where co-expressed genes in a 
cluster often have similar functions. In iASeq, the clusters 
only serve as a tool to describe the correlation structure 
among different datasets (i.e., proteins), rather than the 
functional correlation among different SNPs. The correla- 
tion patterns among datasets are used by iASeq to inform 
one how to integrate information across datasets (i.e., 
which datasets are highly correlated and therefore can 
borrow information from each other) to improve detec- 
tion of AS events for each individual SNP and dataset. In 
order to understand functions of the detected AS events, 
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Table 1 Comparison of iASeq and AlleleSeq 



Gold standard ChrX All Caltech ASE exonic SNPs Autosomal Caltech ASE exonic SNPs 



TF 


T d 


AlleleSeq 


iASeq 


T d 


AlleleSeq 


iASeq 


T d 


AlleleSeq 


iASeq 


YaleCFOS 


41 


3 


4 


9 


5 


3 


9 


5 


3 


YaleMYC 


122 


9 


22 


39 


5 


10 


38 


5 


10 


YaleJUND 


289 


13 


31 


24 


4 


8 


23 


4 


7 


YaleMAX 


105 


3 


18 


18 


3 


1 


18 


3 


2 


YalePollll 


25 


2 


2 


0 


0 


0 


0 


0 


0 



Column 1 : TF name. Column 2: Td is the number of AlleleSeq reported ASB SNPs. Columns 3-4: the number of non-pseudoautosomal region X chromosome SNPs 
among the top Td allele-specific SNPs reported by AlleleSeq and iASeq. Column 5: Td is the number of AlleleSeq reported ASB SNPs that had an exonic SNP within their 
1 0kb neighborhood. Columns 6-7 show among the top Td allele-specific SNPs reported by AlleleSeq and iASeq, how many SNPs had > 1 exonic ASE SNP in their 1 0kb 
neighborhood according to the Caltech RNA-seq experiment. Column 8: Td is the number of AlleleSeq reported autosomal ASB SNPs that had an exonic SNP within 
their 1 0kb neighborhood. Columns 9-10 show among the top Td autosomal allele-specific SNPs reported by AlleleSeq and iASeq, how many SNPs had > 1 exonic ASE 
SNP in their 1 0kb neighborhood according to the Caltech RNA-seq experiment. Additional file 7: Table S3 contains similar results using Yale RNA-seq exonic ASE SNPs 
as gold standard. 



one needs to further correlate the iASeq results with other 
information (e.g., one may determine the parent-of-origin 
of each SNP first and then study various phenomena such 
as imprinting). 

Our observation that different proteins prefer to be 
skewed in the same direction is consistent with a recent 
observation reported in [38] that AS of 24 different TFs 
are skewed toward the same allele. A number of factors 
could contribute to the observed correlation. First, bio- 
logically it is plausible that functionally related HMs and 
TFs have correlated allele-specificity. For instance, both 
H3K4me2 and H3K4me3 are markers for active transcrip- 
tion. Therefore, for a specific SNP, if the reference allele 
is associated with a gene with active transcription but the 
non-reference allele is not, then it is very likely that both 
H3K4me2 and H3K4me3 will be skewed toward the ref- 
erence allele. For another SNP, if the non-reference allele 
is transcribed but the reference allele is not, then both 
H3K4me2 and H3K4me3 will have high probability to be 
skewed toward the non-reference allele. In the genome, 
H3K4me2 and H3K4me3 are skewed toward reference 
allele for some SNPs, and skewed toward non-reference 
allele for some other SNPs. Therefore the skewed SNPs 
could naturally fall into two clusters, representing two 
opposite AS directions. Second, as pointed out by [38], the 
coordinated AS could also occur as a result of the differ- 
ence in the chromatin landscape between the two alleles. 
For instance, if the chromatin on one allele is more open 
and accessible, it could increase the overall binding prob- 
ability of multiple different proteins, leading to correlated 
allelic skewing. 

While our results show that most analyzed TFs/HMs 
tend to be skewed toward the same direction, these 
results do not imply that these proteins are perfectly 
correlated in terms of allele-specificity at each and every 
SNP. In iASeq, the correlation patterns V k and W k are 
probabilistic patterns rather than 0-1 vectors. Each corre- 
lation class k can generate all 3 D AS configurations. For 



instance, for a class with [ V k ; W k ] = [(0.9,0.9,0.9,0.1); 
(0.1,0.1,0.1,0.1)], it is possible to have one SNP with con- 
figuration [ SR, SR, NS, NS] and at the same time another 
SNP with configuration [ SR, NS, SR, NS]. Therefore, SNPs 
in the same class are not required to have the same AS 
configuration, even though they tend to have similar AS 
configurations. The probabilistic patterns are used here 
to provide a parsimonious description of the complex 
correlation structure in the data, so that one can cir- 
cumvent the difficulty of handling 3 D AS configurations 
whose complexity increases exponentially. As a conse- 
quence of using this parsimonious model, multiple weak 
correlation patterns without strong enough data support 
could be merged into a bigger class. For instance, consider 
two AS patterns [ V k ; W k ] =[ (1, 1, 0, 0); (0, 0, 0, 0)] (i.e., 
[SR,SR,NS,NS]) and [ V k ; W k ] =[ (0, 0, 1, 1); (0, 0, 0, 0)] 
(i.e., [NS,NS,SR,SR]). Suppose both patterns are equally 
likely to occur in the data. If each pattern is only associ- 
ated with a small number of SNPs, then a parsimonious 
model will prefer merging them together into one single 
class with [ V k ; W k ] =[ (0.5, 0.5, 0.5, 0.5); (0, 0, 0, 0)]. For 
this reason, iASeq only discovers correlation patterns 
that have sufficient data support so that they can be dis- 
tinguished from other patterns. It will not report weak 
patterns, which could be real but do not have enough data 
support to allow them to be robustly recovered. For users, 
this means that at the cluster level, they may not be able to 
see weak but real AS correlation patterns if these patterns 
are not associated with enough number of SNPs. On the 
other hand, for the purpose of inferring whether or not 
each SNP is allele-specific in each dataset, these parsimo- 
nious correlation patterns are sufficient for describing the 
correlation structure in the data and serving as a prior 
to guide the information sharing across datasets. The 
information sharing will lead the increased ASB detection 
power, and the eventual AS configuration at each individ- 
ual SNP will be determined by the posterior probabilities 
of (bi^Cid) (i.e., Pid) rather than the cluster-level prior 
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probabilities [ W^]. Therefore, in the final AS calls, 
the model still allows each SNP to have its own AS con- 
figuration which may not necessarily be the same as the 
AS configurations of other SNPs from the same cluster. 

Consistent with [38], in the two non-background AS 
patterns discovered here, proteins skewed toward the 
same direction did not always correspond to known 
protein-protein interactions. As pointed out by [38], this 
could happen as a result of allelic imbalances of differ- 
ent proteins being caused by a common underlying factor 
such as allelic difference in chromatin landscape. It could 
also reflect unknown protein-protein interactions. For 
iASeq specifically, there is a third reason, that is, multiple 
small patterns can be merged into a bigger probabilis- 
tic class as described before. For example, because of the 
use of probabilistic patterns, two patterns [SR,SR,NS,NS] 
and [NS,NS,SR,SR] may be merged into a single SNP 
class (e.g., [V k ; W^] =[ (0.5, 0.5, 0.5, 0.5); (0, 0, 0, 0)]). As 
a result, only looking at the pattern represented by 
[ Vkf one cannot tell the details of protein-protein 
interactions, such as these interactions only exist between 
datasets 1 and 2, or between 3 and 4, but not between 
the other pairs of datasets. What one can tell from this 
merged pattern is that, when the allelic imbalance occurs 
in these four datasets, they will be skewed toward the 
same direction, i.e., the reference allele in this example. 

In summary, while the correlation patterns in iASeq 
provide some insights on the correlation of allelic imbal- 
ance among different datasets, one should not over- 
interpret them. The primary goal of these patterns is 
to describe the correlation structure in the data so that 
information from different datasets can be shared in a 
principled way to increase the power of statistical infer- 
ence. This also points to an important difference between 
this study and previous studies that reported coordinated 
allele-specificity among multiple proteins. The previous 
studies only reported the correlation as a biological find- 
ing, but did not provide a statistical method to further 
utilize the correlation structure to improve the statisti- 
cal inference. In contrast, iASeq provides a general and 
rigorous statistical method that utilizes the automatically 
discovered correlation patterns to increase the statisti- 
cal power of AS detection. As such, it represents a novel 
development for the analysis of allele-specificity. 

Model, algorithm, and possible extensions 

Unlike tools such as AlleleSeq which mainly focus on the 
preprocessing steps for the AS analysis (e.g., construc- 
tion of diploid personal genome), iASeq is developed as a 
general model working downstream of the preprocessing 
pipelines. The input data for iASeq are the read counts in 
the format shown in Figure la. With this design, iASeq can 
be easily coupled with different data preprocessing proto- 
cols. For instance, some investigators may map their reads 



to a reference genome, while others may map their reads 
to a diploid personal genome. Both types of investiga- 
tors can use iASeq to integrate information from multiple 
datasets once they obtained the allelic read counts. 

In iASeq, we used an EM algorithm to find the posterior 
mode of parameters and carried out statistical inference 
accordingly. In principle, one may also use a full Bayesian 
approach and Markov Chain Monte Carlo (MCMC) to 
perform the posterior inference. However, since MCMC 
usually takes much longer to run for a big dataset and it 
is not easy for users to monitor convergence, we decided 
to use the posterior mode and EM-based approach in our 
implementation. For analyzing the GM 12878 data with 
94,519 SNPs, iASeq took 5 hours to run the EM algo- 
rithm to fit a single model with K = 1 on a machine with 
2.7 GHz CPU and 4Gb RAM. To fit a single model with 
K = 10 on the same machine, the EM took 16 hours. 
Running the EM for all 10 Ks between 1 and 10 on a sin- 
gle core took 4.6 days. However, when we run these 10 
jobs in parallel on 10 cluster nodes, we were able to select 
the best model within 1 day. Therefore, running the algo- 
rithm on a single machine is a little time-consuming, but 
the computation time can be reduced by parallelization. 
Also, our analysis of GM12878 data indicates that the opti- 
mal K in that real data was 2. For a K not extremely large, 
even if running the full BIC selection on a single machine 
takes some time, it usually requires less than a week, which 
is acceptable compared to the time devoted to preparing 
samples and generating data. 

In principle, the statistical model developed in iASeq 
may also be applied to analyze other types of AS events, 
such as ASE and ASM. In the future, we plan to improve 
the model by incorporating information from the spatial 
correlation among closely located SNPs. For example, for 
the ASE analysis, one may jointly model SNPs from the 
same gene, similar to [33]. 

Implications on future studies 

The analysis of AS events using the high-throughput 
sequencing data frequently faces the problem of low sta- 
tistical power due to the limited amount of information 
available at heterozygote SNPs. One way to increase the 
power is to increase the sequencing depth for one data 
type (e.g., MYC ChlP-seq). An alternative approach is to 
spend the same amount of money to generate data for 
multiple different but related data types (e.g., ChlP-seq 
for MYC, H3K4mel, H3K4me3, etc.), each with a lower 
coverage. One can then integrate the multiple datasets to 
increase the statistical power of allele-specificity analysis. 
The merit of the second approach is that one can col- 
lect multiple different types of information which might 
be useful for other purposes (e.g., in addition to study- 
ing MYC binding using MYC ChlP-seq, one may couple 
H3K4mel ChlP-seq data with DNA motif information to 
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locate active enhancers and predict binding sites of other 
TFs in the genome). If the second approach is used in the 
study design, then iASeq will offer a flexible, powerful and 
scalable framework for better analyzing the AS events in 
the data. As ChlP-seq data continue to grow rapidly, this 
integrative approach will allow us to use the data more 
efficiently to characterize the allele -specificity. 

Conclusions 

In summary, we have proposed a Bayesian hierarchi- 
cal mixture model iASeq to integrate multiple ChlP-seq 
datasets for analyzing allele-specificity. The primary goal 
of iASeq is to increase the statistical power of AS detec- 
tion, and it does so by taking the advantage of correlations 
among datasets. Since the correlation structure may not 
be known before the data analysis, iASeq learns it from the 
data automatically. Application of iASeq to the ENCODE 
GM 12878 data shows that allelic imbalance of most ana- 
lyzed TFs and HMs have strong preference to be skewed 
toward the same direction. Analysis of both the simulated 
and real data show the effectiveness of iASeq to improve 
detection of allele-specificity compared to single-dataset 
based methods. 
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